Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

نویسندگان

Huaiping Ming

Dong-Yan Huang

Lei Xie

Jie Wu

Minghui Dong

Haizhou Li

چکیده

Emotional voice conversion aims at converting speech from one emotion state to another. This paper proposes to model timbre and prosody features using a deep bidirectional long shortterm memory (DBLSTM) for emotional voice conversion. A continuous wavelet transform (CWT) representation of fundamental frequency (F0) and energy contour are used for prosody modeling. Specifically, we use CWT to decompose F0 into a five-scale representation, and decompose energy contour into a ten-scale representation, where each feature scale corresponds to a temporal scale. Both spectrum and prosody (F0 and energy contour) features are simultaneously converted by a sequence to sequence conversion method with DBLSTM model, which captures both frame-wise and long-range relationship between source and target voice. The converted speech signals are evaluated both objectively and subjectively, which confirms the effectiveness of the proposed method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform

An artificial neural network is one of the most important models for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC) which represent the spectrum features. However, a simple representation for fundamental frequency (F0) is not enough for neural networks to deal with an...

متن کامل

Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform

An artificial neural network is an important model for training features of voice conversion (VC) tasks. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as Mel Cepstral Coefficients (MCC), which represent the spectrum features. However, a simple representation of fundamental frequency (F0) is not enough for NNs to deal with emotional voice VC. This is ...

متن کامل

Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data

Deep learning techniques have been successfully applied to speech processing. Typically, neural networks (NNs) are very effective in processing nonlinear features, such as mel cepstral coefficients (MCC), which represent the spectrum features in voice conversion (VC) tasks. Despite these successes, the approach is restricted to problems with moderate dimension and sufficient data. Thus, in emot...

متن کامل

Statistical approach to perceived age control of singing voice

The perceived age of a singing voice is the age of the singer as perceived by the listener, and is one of the notable characteristics that determines perceptions of a song. Singers can sing expressively by controlling prosody and voice timbre, but the varieties of voice timbre that singers can produce are limited by physical constraints. Previous work has attempted to overcome the limitation th...

متن کامل

Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

The perceived age of a singing voice is the age of the singer as perceived by the listener, and is one of the notable characteristics that determines perceptions of a song. In this paper, we describe an investigation of acoustic features that have an effect on the perceived age, and a novel voice timbre control technique based on the perceived age for singing voice conversion (SVC). Singers can...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion

نویسندگان

چکیده

منابع مشابه

Emotional Voice Conversion Using Neural Networks with Different Temporal Scales of F0 based on Wavelet Transform

Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform

Emotional Voice Conversion with Adaptive Scales F0 Based on Wavelet Transform Using Limited Amount of Emotional Data

Statistical approach to perceived age control of singing voice

Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

عنوان ژورنال:

اشتراک گذاری